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Abstract 

This paper describes a massively parallel algebraic multigrid method 
based on non-smoothed aggregation. It is especially suited for solving 
heterogeneous elliptic problems as it uses a greedy heuristic algorithm 
for the aggregation that detects changes in the coefficients and prevents 
aggregation across them. Using decoupled aggregation on each process 
with data agglomeration onto fewer processes on the coarse level, it weakly 
scales well in terms of both total time to solution and time per iteration to 
nearly 300,000 cores. Because of simple piecewise constant interpolation 
between the levels, its memory consumption is low and allows solving 
problems with more than 10" de grees of freedom. 

Key words: algebraic multigrid, parallel coniputing, preconditioning, HPC, 
high-performance-computing 
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1 Introduction 

When solving elliptic or parabolic partial differential equations (PDEs) most 
of the computation time is often spent in solving the arising linear algebraic 
equations. This demands for highly scalable parallel solvers capable of running 
on recent supercomputer. The current trend in the development of high perfor- 
mance supercomputers is to build machines that utilize more and more cores 
with less memory per core, but interconnected with low latency networks. To 
be able to still solve problems of reasonable size the parallel linear solvers need 
to be (weakly) scalable and have a very small memory footprint. 

Besides domain decomposition methods the most scalable and fastest meth- 
ods are multigrid methods. They can solve these linear systems with optimal 
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or nearly optimal complexity, i.e. at most 0(7VlogA^) operations for N un- 
knowns. Among them algebraic multigrid methods (AMG) are especially suited 
for problems with heterogeneous or anisotropic coefficient tensors on unstruc- 
tured grids. They build a hierarchy of matrices using their graphs and thus 
adapt the coarsening to the problem solved. 

Parallel geometric multi-grid implementations exist since at least 25 years, 
sec e.g. [16]. Since about 15 years, several parallel algebraic multigrid codes 
have been developed [15, 29, 20, 3, 18]. Classical AMG [26] divides the fine level 
unknowns into two groups: the ones also represented on the coarse level, and 
the ones that exist only on the fine level. In its parallel version the splitting 
of the unknowns has to be globally consistent. This makes the coarsening of 
unknowns, that are either also represented on other processes or adjacent to 
unknowns on them, not only difficult but also inherently sequential near the 
inter-process boundary. It causes more communication and increases operator 
complexity (sum of the number of nonzeros of the matrices on all levels divided 
by the number of nonzeros of the fine level matrix). Therefore, especially in 
3D, the time needed for building the operator hierarchy and the time needed to 
perform one iteration increase with the problem size and the number of cores 
used, albeit the number of iterations needed for convergence stays optimal. 
Adapted coarsening heuristics and aggressive coarsening strategies have been 
developed to partly overcome this problem [28, 13, 1]. 

AMG based on aggregation, see [12, 30, 25], clusters the fine level unknowns 
into aggregates. Each aggregate represents an unknown on the coarse level 
and its basis function is a linear combination of the fine level basis functions 
associated with the aggregate. Two main classes of the method exist. Non- 
smoothed aggregation AMG, see [12, 25, 22, 8], which uses simple piecewise 
constant interpolation, and smoothed aggregation AMG, that increases inter- 
polation accuracy by smoothing the tentative piecewise constant interpolation. 
For the parallel versions of both classes no growth in operator complexity is 
observed for increasing numbers of processes [29, 22, 8]. Still the smoothing of 
the interpolation operators increases the stencil size of the coarse level matrices 
compared to the non-smoothed version. Moreover, the non-smoothed version 
can be used in straight forward way for many systems of PDEs, see [8] . 

In this paper we describe a parallel AMG method that uses a greedy heuristic 
algorithm for the aggregation based on a strength of connection criterion. This 
allows for building round aggregates of nearly arbitrary size that do not cross 
high contrast coefficient jumps. We use simple piecewise constant interpolation 
between the levels to prevent an increase of the size of the coarse level stencils. 
Together with an implementation of the parallel linear algebra operations based 
on index sets this makes the algorithm very scalable regarding the time needed 
per iteration. Even though the number of iterations needed for convergence do 
increase during weak scalability tests, the time to solution is still very scalable. 
We present numerical evidence that the approach is scalable up to 262,144 
cores for realistic problems with highly variable coefficients. At the same time 
the memory requirement of the algorithm is far less than that of classical AMG 
methods. This allows us to solve problems with more than 10^^ degrees of 
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unknowns on an IBM Blue Gcnc/P using 64 racks. 

We will start the paper in the next section with a description of the algebraic 
multigrid method together with our heuristic greedy aggregation algorithm for 
coarsening the linear systems. In Section 3 wc describe the parallclization of 
the algebraic multigrid solver and its components, namely the data decomposi- 
tion, smoothers, interpolation operators, and linear operators. After presenting 
implementational details about the parallclization and linear algebra data struc- 
tures in Section 4, we conduct scalability tests of our method on an IBM Blue 
Gene/P and an off-the-shelf multicore Linux cluster in Section 5. Our summary 
and conclusions can be found in Section 6. 

2 Algebraic Multigrid 

For any finite index set J C N wc define the vector space to be isomorphic 
to RI'^I with components indexed by i G /. Thus x S can be interpreted 
as a mapping x : 7 — > M and Xi = x(z). In the same way, for any two finite 
index sets /, J C N we write A S R^^"^ with the interpretation A : / x J — > R 
and Uij = A{i,j). Finally, for any subset /' C / we define the restriction 
matrix Rjj' : R^ — >• R^ as (R/,/'x)i = Xi "ii € I' (which corresponds to simple 
injection). 

On a given domain we are interested in solving the model problem 

V- (K(,t)Vw) = /, onn (1) 

together with appropriate boundary conditions. Here, the symmetric positive 
definite tensor K(a::), dependent on the position x within the domain fl, is allowed 
to be discontinuous. Given an admissible mesh Th that for simplicity resolves 
the boundary and possible discontinuities in the tensor K(a;), discretizing (1) 
using conforming lowest order Galerkin finite element or finite volume methods 
yields a linear system 

Ax = 6, (2) 

where A : R" — >■ R" is the linear operator, and x,b € R" are vectors. For an 
extension to discontinuous Galerkin methods see [6]. We strive to solve this 
linear system using our algebraic multigrid method described below. 

The excellent computational complexity of multgrid methods is due to the 
following main idea. Applying a few steps of a smoother (such as Jacobi or 
Gauss-Seidel) to the linear system usually leads to a smooth error that cannot 
be reduced well by further smoothing. Given a prolongation operator P from a 
coarser linear system, this error is then further reduced using a correction u'^°'^'^^<^'^ 
on a coarser linear system p'^APu'^"'^'''^*^ = P-^(b — Ax). We use the heuristic 
algorithm presented in Subsection 2.1 to build the prolongation operator P. If 
the system is already small enough, we solve it using a direct solver. Otherwise 
we recursively apply a few steps of the smoother and proceed to an even coarser 
linear system until the size of the coarsest level is suitable for a direct solver. 
After applying the coarse level solver, we prolongate the correction to the next 
finer level, add it to the current guess, and apply a few steps of the smoother. 
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2.1 Coarsening by Aggregation 

To define the prolongation operator P we rely on a greedy and heuristic aggre- 
gation algorithm, that uses the graph of the matrix as input. It is an extension 
of the version published by Raw (of. [24]) for algebraic multigrid methods (see 
also [27]). 

Let G = (V, £) be a graph with a set of vertices V and edges £ and let 
ws ■ f ^ K and wy : V — > K be positive weight functions. For the examples in 
this paper W£{{i, j)) = \{aij — \ is used, i.e. for positive off-diagonals and 
the absolute value otherwise, and wv{i) = an- For matrices arising from the 
discretisation of systems of PDE, for which our aggregation scheme is applicable 
as well, wg and wy could e.g. be the row-sum norm of a matrix block (see e.g. 
[8]). These functions are used to classify the edges and vertices of our graph. 
Let 

N{i) := {j GV\3{j,i)G£} 
be the set of adjacent vertices of vertex i and let 

._(*) := max Mik,^))we{{^ k)) _ 
feeiv(j) wv{t) wv{k) 

(a) An edge (j, i) is called strong, if and only if 

ws{{i,j))ws{{j,i)) 



wv{i) wv{j) 



> 6 min(r?max(i) U)), (4) 



for a given threshold < 6 < 1. We denote by Ns{i) C N{i) the set of all 
vertices adjacent to i that are connected to it via a strong edge. 

(b) A vertex i is called isolated if and only if ?7inax(*) < /?, for a prescribed 
threshold < /3 < 1. We denote by ISO{V) C V the set of all isolated 
vertices of the graph. 

For symmetric positive definite M-matrices arising from problems with con- 
stant diffusion coefficients our strength of connection criterion is similar to the 
traditional ones for the AMG of Ruge and Stiiben [26] and for smoothed ag- 
gregation [31]. For non-symmetric matrices or problems with discontinuous 
coefficients it differs from them. It is especially tailored for the latter. At the 
interfaces of the jumps the Ruge Stiiben criterion might classify a connection be- 
tween two vertices strong in one direction and weak in the other one. Actually, 
no aggregation should happen across this interface. The smoothed aggregation 
criterion falsely classifies positive off-diagonal values as strong while ours does 
not do this with an appropriate weight function. For more details see [8] . 

Our greedy aggregation algorithm is described in Algorithm 1. 
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Algorithm 1 Build Aggregates 

procedure AGGREGATI0N(V, £, Smin, Smax, C^max) 

f V \ ISO{y) t> First candidates are non-isolated vertices 

/ i> Coarse index set 

S ■(-{ueU : #A^„a(w) < #Afna('i«) Vw € [/} Seed stack 

Pop seed v from S 
while C/ 7^ do 

flu •(— {w} i> Initialize new aggregate 

U -^UXa^ 
I ^ lU{v} 

GROWAGGREGATE(aj;, V, £, Smin, ^max, U) 
ROUND AGGREGATE(a„, V, £, Smax, U) 

if = 1 then > Merge one vertex aggregate with neighbor 

C -S— {ttj : j € I \ {v} and 3w; € with w G Ns{v)} 
if C ^ then 
Choose Qk & C 
I^I\{v} 
Ofe afe U a„ 
end if 
end if 

S -S- {w : w e iV(a„)} 
if [/ ^ then 

if 5 = then 

S^{u€U: #7V„a(w) < #N^e.{w) yw e [/} 

end if 

Pop seed v from S 
end if 
end while 

U //S'(3(V) £> Aggregate isolated vertices 

while Uj^Udo 

Select arbitrary seed v gU 

a-v {f } 
U ^U\ay 

I <-I\j{v} 

GROWlSOAGGREGATE(a, V, £, Smin, (imax, C^) 

end while 

A -i^ {ai : i e 1} 
return {A, I) 
end procedure 



Until all non-isolated vertices arc aggregated, we start a new aggregate av 
with a non-isolated vertex v. To select this vertex we use a finite stack. When- 
ever we try to pop a vertex from an empty stack, we fill it with the vertices, 
which have the least number of non-aggregated neighbors iVna- Each time an 
aggregate is finished, all non-isolated non-aggregated neighbors are pushed onto 
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the stack. The index of the seed vertex is associated with this new aggregate 
and added to the index set /. The algorithm returns both the index set / for 
the set of aggregates as well as the set A = {a^ : i G 7} of all aggregates it has 
built. 

The first step in the construction of an aggregate in Algorithm 1 is to add 
new vertices to our aggregate until we reach the minimal prescribed aggregate 
size Smin- This is outlined in Algorithm 2. 



Algorithm 2 Grow Aggregate Step 



function GROWAGGREGATE(a, V, £, Smin, rfmax, U) 

while #a < Smin do t> Makes aggregate a bigger until its size is Smin 
Co -5— {w G N{a) : diam(a, v) < (imax} > Limit the diameter of the 
aggregate 

Ci G Co : cons2(v,a) > cons2(w, a) Vw G N{a)} 

if Ci = then > No candidate with two-way connections 

Ci ■(— {« G Co : consi(w,a) > consi(w, a) Vw G N{a)} 
end if 

if #Ci > 1 then > More than one candidate 

r< J I^, ^ n . connect(v,a) \ connect(tu,a) w ^ ,^ 1 

end if 

if #Ci > 1 then t> More than one candidate 

Ci {w G Ci : neighbors(w, a) > neighbors(w, a) G Ci} 
end if 

if Ci = then break 
end if 

Select one candidate c G Ci 

a a U {c} > Add candidate to aggregate 

U^U\{c} 
end while 
end function 



When adding new vertices, we always choose a vertex within the prescribed 
maximum diameter rfmax of the aggregate which has the highest number of 
strong connections to the vertices already in the aggregate. Here we give prefer- 
ence to vertices where both edge and edge are strong. The functions 
consi(t;,a) and cons2(t;,a) return the number of one-way and two-way strong 
connections between the vertex v and all vertices of the aggregate a, respectively. 

If there is more than one candidate, we want to choose a vertex with a high 
proportion of strong connections to other vertices not belonging to the current 
aggregate, while favoring connections to vertices which belong to aggregates that 
are already connected to the current aggregate. We therefore define a function 
connect(i;, a), which counts neighbors of v that are not yet aggregated or belong 
to an aggregate that is not yet connected to aggregate a once and neighbors of 
V that belong to aggregates that are already connected to aggregate a twice. 

If there is still more than one candidate which maximizes , we 
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choose the candidate, which has the maximal number neighbors(w, a) of neigh- 
bors of vertex v that are not yet aggregated neighbors of the aggregate a. This 
criterion tries to maximize the number of candidates for choosing the next ver- 
tex. 



Algorithm 3 Round Aggregate Step 

function RoundAggregate(o, V, £, s,„ax, U) 

while #a < Smax do > Rounds aggregate a while size < Smax 

D {w e N{v) n U : cons2(w, a) > or consi(w, a)} > 

C ^{veD: 4l={N{v) nu}> #{A^(w) n U}} 
Select arbitrary candidate c S C 

o a U {c} t> Add candidate to aggregate 

U ^U\ {c} 
end v^rhile 
end function 



In a second step we aim to make the aggregates "rounder" . This is sketched 
in Algorithm 3. We add all non-aggregated adjacent vertices that have more 
connections to the current aggregate than to other non-aggregated vertices un- 
til we reach the maximum allowed size .Smax of our aggregate. The function 
cons(w, C) returns the number of strong connections between vertex v and the 
members of the set C. 

If after these two steps an aggregate still consists of only one vertex, we try 
to find another aggregate that the vertex is strongly connected to. If such an 
aggregate exists, we add the vertex to that aggregate and choose a new seed 
vertex. 

Finally, once all the non-isolated vertices are aggregated, we try to build 
aggregates for the isolated vertices. Where possible, we build these by ag- 
gregating adjacent isolated vertices that have at least one common neighbor- 
ing aggregate consisting of non-isolated vertices. This is done in the function 
GrowIsoAggregate which we do not present here. The most common situ- 
ation where connected isolated vertices occur is when Dirichlet boundary con- 
ditions are represented as unknowns in linear systems. In this case their ag- 
gregation does not introduce any errors, but prevents the stencils of the coarse 
levc;l matrices from filling up and thus leads to reasonable coarsening rates and 
operator complexities. Other situations of such occurrences are very rare. 

Given the aggregate information A, we define the piecewise constant prolon- 
gation operator P by 

and define the coarse level matrix using a Galerkin product as 

1 

^coarse pi y\^p 

Note that the overrelaxation factor lo is needed to improve the approximation 
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properties of the coarse correction. According to [12] w = 1.6 is a good default 
and often sufficient for good convergence. 



3 Parallelization 

3.1 Data Decomposition and Local Data Structures 

The most important and computationally expensive part in both algebraic 
multigrid methods and many iterative solvers (especially Krylov methods and 
stationary iterative methods) is the application of linear operators. Therefore 
their construction is crucial. It has to be made in a way that allows for efficient 
application of the linear operator as well as the construction of preconditioners 
or smoothers from these linear operators. 

Let / C N be our finite index set and UpeP-^(p) ^ (non-overlapping) 
partitioning of the index set between the processes V participating in the com- 
putation. This partitioning might be given by an external partitioning software 
or by a parallel grid manager used for e.g. a parallel finite element discretiza- 
tion. In some cases these tools may not provide a partitioning but an overlapping 
decomposition. In this case one can easily compute; a partitioning following [10]. 

Unfortunately, using such a partitioning and storing on process p only the 
subset of entries of the global vector and rows of the global matrix that is 
associated with its part of the index set J(p) , would mean triggering one com- 
munication for each matrix entry a^, with i e /(p) and j ^ /(p). Our goal 
is to separate the communication from the application of the linear operator. 
Therefore each process p stores data associated with a larger index set 



'(p) 



/(p) U {j Gl\3iG I(p) : aij ^ 0} , 



namely A(p) S MAp)^Ap) and X(p),b(p) £ RAp). We call X(p) stored consistently 
if X(„) = Rr r X for all p gV. If x/„) = R~ ^ R/ /, , x holds for all p G "P, we 

yi-; -')-'(p) /(p),/(j,) ' (P) 

denote X(p) as uniquely stored. 

Let A G K^^^ be a global linear operator. Then on process p the local linear 
operator A(p) stores the values 



A(p)(«,j) 



A{i,j) if i e /(p) 



else 



(6) 



ij, 



(p) 



where Sij denotes the usual Kronecker delta. Denoting A(p) — Rj j^^^AR 
and reordering the indices locally, such that i < j for all i € /(p) and j ^ 
/(p) \ J(p) , A(p) has the following structure: 



'(p) 



A(p) 


* 
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Using this storage scheme for the local linear operator A|-p-) and applying 

it to a local vector X(p), stored consistently, (A(p)X(p))(i) = (Ax)(i) holds for 
all i e /(p). Therefore the global application of the linear operator can be 
represented by computing 

A-=E^^/(.)^7(.„/(jA(,)i?,_,^^x) . (7) 
pev 

Here the operators in front of the brackets represent a restriction of the re- 
sults of the local computation to the (consistent) representation on R^(p) then 
a prolongation to global representations and a summation of all these global 
representations. In contrast to the notation used there is no global summation 
and thus no global communication needed. It suffices that every process adds 
onlj entries from other processes that actually store data associated with indices 
in 7(p) . Therefore this represents a next neighbor communication followed by a 
local summation. 

3.2 Pcirallel Smoothers 

As smoothers we only consider so-called hybrid smoothers [21]. These can be 
seen as block- Jacobi smoothers where the blocks are the matrices ^(p) . Instead 
of directly solving the block systems a few steps of a sequential smoother (e.g. 
Gauss-Seidel for hybrid Gauss-Seidel) is applied. We always use only one step. 
Let M(p) e R^(p)^^(p), p £ V, he the sequential smoother computed for 

matrix A(p), and d(p) the defect. Then the consistently stored update V(p) is 
computed by applying the parallel preconditioner as 

Again as in the parallel linear operator 7 the summation requires only a com- 
munication with processes, which share data associated with 7(p) , and thus can 
be handled very efficiently. 

3.3 Parallel Coarsening 

The parallelization of the coarsening algorithm described in Section 2.1 is rather 
straightforward. It is simple and massively parallel since the aggregation only 
occurs on vertices of the graph of matrix A(p). Using this approach, the coars- 
ening process will of course deal better with the algebraic smoothness if the 
disjoint matrix is split along weak edges. 

The parallel approach is described in Algorithm 4. It builds the aggre- 
gates A(^p) of this level and the parallel index sets 1^°^''^'^ for the next level 
in parallel. JThe^^parameters are the edges and vertices of the matrix graph 
G'(A(p)) = (/(p),£(p)) and the disjoint index set /(p). The rest of the parame- 
ters are the same as for the sequential Algorithm 1. As a first step a subset 
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(7(p),f(p)) of the input graph that corresponds to the index set /(p) is created. 
Then the sequential aggregation algorithm is executed on this sub-graph. Based 
on the outcome of this aggregation a map between indices and corresponding 
aggregate indices is built and the information is published to all other processes 
that share vertices of the overlapping graph. Now every process knows the ag- 
gregate index of each vertex of its part of the overlapping graph and constructs 
the overlapping coarse index set and the aggregates. Note that this algorithm 
only needs one communication step per level with the direct neighbors. 



Algorithm 4 Parallel Aggregation 

procedure PARALLElAGGREGATION(/(p), f(p), Smin, Smax, rfmax) 

On process p G V: 

f(p) <— {{k. I) e £(p) I k e /(p) A I E /(p)} Only edges within 7(p) 

(•jcoarsc^_4^^^^ ^ AGGREGATION(/(p) , f (j,) , 

^miri) 5max) C^max) 

for ttfe G A(p) do 
end for 

ni(p) ^ Rj J J2nev ^ Communicate aggregates mapping 

' (p) ^ (q) 

jcoarso ^ |^ | ^ j^^^ ^-^j^ (-^/ J ^ip))j = k} > Build coarse index set 
for k e 7^')'*''^'= do 

ak ^ {j e I{p) I {R'fr S(p))j- = A:} 
end for ^ 

"^(p) {"^fe ■ ^ € ^(^pf'^''''} Aggregate information 

return (7g'p,Ap)) 
end procedure 



For each aggregate on process p, that consists of indices in /(p) \ 7(p) on 
the fine level, the child node, representing that aggregate on the next coarser 
level, is again associated with an index i £ au C I(p) \ I(p) . This means that for 
all vertices in /(p) on the coarse level all neighbors they depend on or influence 
are also stored in process p. ^ 

The local prolongation operator P(p) is calculated from the aggregate in- 
formation ^(p) in accordance to (5). Let A'^^ be the local fine level matrix, 
then the tentative coarse level matrix is computed by the Galerkin product 
~ ^fp)^[p)^'-P^' satisfy the constraints of our local operators (6), we 
need to set the diagonal values to 1 and the off-diagonal values to for all ma- 
trix rows corresponding to the overlap region /|+^ \/'+^ Due to the structure 
of the matrices in the hierarchy all matrix-vector operations can be performed 
locally on each processor provided that the vectors are stored consistently. 
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3.3.1 Agglomeration on Coarse Levels 

Note that our aggregation Algorithm 4 does not build any aggregates that cross 
over the borders of our disjoint partitioning. On the fine level, we rely on the 
user (or third party software) providing our solver with a reasonable partitioning 
of the global matrices and vectors onto the available processes. Often this 
partitioning will not take weak connection in the matrix graph into account. 
If we would continue the coarsening until no further decoupled aggregation 
is possible, the local part of the global matrix would cither only have very 
few entries, entries mainly coupled by weak connections, e.g. for anisotropic 
problems, or both. This would cause a break down of the coarsening rate and/or 
very few computations between communication steps with small messages. 

The currently available supercomputers, like the Blue Gene type systems of 
IBM, already make hundreds of thousands of cores available for usage. Even if 
the coarsest level system only has a few unknowns per core, thc^ global system 
still has several hundred thousand unknowns. Solving such a coarse level sys- 
tem in parallel would mean doing very few floating point operations between 
many commimication steps. Therefore this computation would be limited by 
the available bandwidth and latency of the communication network. 

To overcome these problems, there is the option to agglomerate the data 
onto fewer processes. If agglomeration is not activated, all processes will com- 
pute on the coarsest level and a parallel Krylov method preconditioned with 
the smoother is used as a solver. Otherwise, whenever the average number of 
unknowns per core on a Icvcil drops below the prescribed coarsening target a 
new partitioning is computed using Par METIS (cf. [23, 19]), a parallel graph 
partitioning software. We support two different choices for the input graph of 
the partitioning. The logically most reasonable is to use the weighted graph of 
the global matrix. Its edge weights are set to 1 for edges that are considered 
strong by our strength of connection measure and to otherwise. This tells the 
graph partitioning software that weak connections can be cut at no cost and 
leads to partitionings that should keep small connected regions on one process. 
We believe that this approach results in sufficient coupling of strongly connected 
unknowns on coarser grids. 

Unfortunately, at least as of version 3.1.1 Par METIS uses a dense array of 
size i^V X internally to capture all possible adjacencies between processes. 
This results in running out of memory on systems with very many cores (like 
the IBM Blue Gene/P). To prevent this we use the vertex-weighted graph of 
the communication pattern used in the parallel linear operator as input. Each 
process represents a vertex in the graph. The weight of the vertex is the number 
of matrix rows stored on this process. Edges appear only between pairs of 
vertices associated with processes that exchange data. This graph is gathered on 
one master process and the repartitioning is computed with the recursive graph 
repartitioning routine from sequential METIS. Then the data (matrices and 
vectors) of all processes associated with vertices in one partition, is agglomerated 
on one process and the others become idle on coarser levels. Obviously this is 
a sequential bottleneck of our method. It will improve once massively parallel 
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graph partitioning tool become available. 

This kind of agglomeration is repeated on subsequent levels until there is only 
one participating process on the coarsest level. We can now use a sequential 
sparse direct solver as coarse level solver. 

In Figure 1, the interplay of the coarsening and the data agglomeration 
process is sketched. Each node represents a stored matrix. Next to it the 
level index is written. As before the index denotes the finest level. Note 
that on each level, where data agglomeration happens, some processes store two 
matrices, an agglomerated and a non-agglomerated one. The latter is marked 
with an inverted comma after the level number. 




H ^» 

1 number of processes in use P 



Figure 1: Data agglomeration 

Whenever data agglomeration happened, the parallel smoothers use the not 
yet agglomerated matrix. The agglomerated matrix is only needed for the coars- 
ening to the next level. 

4 Implementational Details 

The described algorithm is implemented in the "Distributed and Unified Nu- 
merics Environment" (DUNE) [7, 14, 5, 4]. As the components of this library 
are the main cause for the good performance of our method, we shortly intro- 
duce the two main building blocks of our AMG method: the parallel index sets, 
and the "Iterative Solver Template Library" (ISTL) [9, 10]. 

4.1 Parallel Index Sets 

Our description of the parallelization in Section 3 is based on parallel finite 
index sets. This natural representation is directly built into our software and 
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used for the communication. We will only shortly sketch the relevant parts of 
the implementation. For a complete description of the parallel index set software 
see [11]. 

Each process p stores for each level one mapping of the corresponding index 
set /(p) to {0, . . . , \I(p) \ — 1}- This mapping allows for using the efficient local 
matrix and vector data structures oflSTL to store the data and allows for direct 
random access. For every entry in /(p) an additional marker is stored that lets 

us identify whether the index belongs to J(p) \ /(p) or to /(p). The mapping 
is represented by a custom container, that provides iterators over the entries. 
The key type used for this mapping is not limited to builtin integers but can be 
any integral numeric type. This allows us to realize keys with enough bits to 
represent integers bigger than 1.3 • 10^^. 

Using these index sets all the necessary communication patterns are precom- 
puted. The marker allows us for example to send data associated with /(p) for 

every process p & V to all processes q £ V with /(^j n /(p) ^ 0. This kind of 
communication is used for the parallel application of the linear operator, the 
parallel smoother, and the communication of the aggregate information after 
the decoupled aggregation of one level. These communication patterns are im- 
plemented independent of the communicated type. The same pattern can be 
used to send for example vector entries of type double or the aggregate mimbers 
represented by an arbitrarily sized integer type. During the communication 
step we collect all data for each such pair p, q of processes in a buffer and send 
all messages simultaneously using asynchronous communication of MPI. This 
keeps the number of messages as low as possible and at the same time uses the 
maximal message size possible for the problem. This reduces negative effects of 
network latency. As described already in Subsections 3.1 and 3.2 only one such 
communication step is necessary for each application of the linear operator or 
smoother. For the three dimensional model problems of the next section each 
process sends and receives one message to and from at most eight neighboring 
processes. The size of the mesage is smaller than 152 kByte. 

It is even possible to use two different distributions Upg-p J{p) — ^ ^^'^ 
UpeQ^(p) = I 8tS source and target of the communication. Whenever data 
agglomeration occurred on a level, we have two such distributions with #7-" > 
#Q. To collect data we send data associated to /(p) for every process p gV to 
all processes q £ Q with Ig n /(p) ^ 0, when gathering data to fewer processes. 

4.2 Efficient Local Linear Algebra 

The "Iterative Solver Template Library" (ISTL) [9] is designed specifically for 
linear systems originating from the discretization of partial differential equa- 
tions. An important application area are systems of PDEs. They often exhibit 
a natural block structure. The user of our method can either neglect this block 
structure like with most other libraries. The linear system is then simply re- 
sembled by a sparse matrix with scalar entries. In addition our method also 
supports a block-wise treatment of the unknowns, where all unknowns associ- 
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ated with the same discretization entity are grouped together. These groups 
must have the same size for all entities. The couplings between the grouped un- 
knowns are represented by small dense matrices. The unknowns themselves in 
small vectors. The size of the matrices and vectors is known already at compile 
time. 

ISTL offers specialized data structures for these and in addition supports 

block recursive sparse matrices of arbitrary recursion level. Using generic pro- 
gramming techniques the library lets the compiler optimize the code for the data 
structures during compilation. The available preconditioners and smoothers are 
implemented such that the same code supports arbitrary block recursion levels. 

Therefore our method naturally supports so-called point-based AMG, where 
each matrix entry is a small dense matrix by itself. The graph used during 
the coarsening in Section 2.1 is the graph of the block matrix and the weight 
functions used in the criterions (3) and (4) are functions that turn the matrix 
blocks into scalars, such as the row-sum or Frobenius norm. The user only has 
to select the appropriate matrix and vector data structures and the smoother 
automatically becomes a block-smoother due to generic programming with tem- 
plates. 

5 Numerical Results 

In this section we present scalability results for two model problems. First we 
solve simple Laplace and Poisson problems. Then we take a look at a hetero- 
geneous model problem with highly variable coefficients. We perform our test 
on two different hardware platforms: a super-computer from IBM and a recent 
of-the-shelf Linux cluster. 

The first machine is JUGENE located at the Forschungszentrum in Jiilich, 
Germany. JUGENE is a Blue Gene/P machine manufactured by IBM that 
provides more than one petaflops as overall peak performance. Each compute 
node uses a 850 MHz PowerPC 450 quad-core CPU and provides 2GB of main 
memory with a bandwidth of 13.6 GB/s. The main interconnect is a 3D-Torus 
network for point to point message passing with a peak hardware bandwidth 
of 425MB/s in each direction of each torus link and a total of 5.1 GB/s of 
bidirectional bandwidth for each node. Additionally there are a global collective 
and a global barrier network. For comparison we also performed some tests on 
helicsSa at Heidelberg University, an of-the-shelf Linux cluster consisting of 32 
compute nodes with four AMD Opteron 6212 CPUs providing eight cores, each 
at 2.6 GHz. Each node utilizes 128 GB DDR3 RAM at 1333Mhz as main 
memory. The Infiniband network interconnect is a Mellanox 40G QDR single 
port PCIe Interconnect QSFP with 40 GB/s bidirectional bandwidths. 

We start the analysis of our method by solving the Laplace equation, i.e. 
K = I, with zero Dirichlet boundary conditions everywhere. The results of the 
weak scalability test can be found in Table 1. For the discretization we used a 
cell-centered finite volume scheme with 80"^ cells per participating core. Note 
that the biggest problem computed contains more than 1.34-10^^ unknowns. 
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The problems are discretized on a structured eube grid with uniform grid 
spacing h. We use one step of the V-cycle of the multigrid method as a pre- 
conditioner in a BiCGSTAB solver. For pre- and post-smoothing we apply one 
step of hybrid symmetric Gauss-Scidcl. We measure the number of iterations 
(labelled It) to achieve a relative reduction of the Euclidian norm of the residual 
of 10~^. Note that in each iteration the preconditioner is applied twice. We 
measure the number of grid levels (labelled lev.), the time needed per iteration 
(labelled Tit), the time for building the AMG hierarchy (labelled TB), the time 
needed for solving the linear system (labelled TS), and the total time needed to 
solution (labelled TT) including setup and solve phase depending on the num- 
ber of processors (which is proportional to the number of grid cells 1/h cubed). 
Time is always measured in seconds. 
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4.93 
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137.1 
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10 


81.31 


64.45 


13 


4.958 


145.8 


32768 


2560 


11 


92.75 


65.55 


13 


5.042 


158.3 


262144 


5120 


12 


188.5 


67.66 


13 


5.205 


256.2 



Table 1: Laplace Problem 3D on JUGENE: Weak Scalability 



Clearly, the time needed per iteration scales very well. When using nearly 
the whole machine in the run with 262,144 processes, we still reach an efficiency 
of about 77%. Due to the slight increase in the number of iterations the effi- 
ciency of the solution phase is about 47%. Unfortunately, the hierarchy building 
does not scale as well. This has different components, which can best be dis- 
tinguished by an analysis of the time needed for some phases of the coarsening 
with agglomeration, which is displayed in Table 2. 
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12 
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98.23 


4.65 


85.71 



Table 2: Time needed for Agglomeration and Coarsening(Laplace 3D) 

In the table the column labelled "no." contains the number of data agglom- 
eration steps, the column labelled "TG" contains the time spend in preparing 
the global graph on one process, partitioning it with METIS, and creating the 
communication infrastructure. The column "TM" contains the time needed for 
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redistributing the matrix data to the new partitions, and the cohimn labehed 
"TR" contains the total time needed for the rest of the coarsening including the 
time for the factorization of the matrix on the coarsest level using SuperLU. If 
we directly agglomerate all data to one process, wc do not use METIS as the 
repartitioning scheme is already known in advance. With an increasing number 
of processes the time spent for computing the graph repartitioning increases 
much faster than the time needed for the redistribution of the data. It turns 
out that this is one of the main bottlenecks, especially with very high proces- 
sor numbers. However, there is also a markable increase in the time needed 
for the coarsening process itself. Without agglomeration the creation of the 
coarsest matrices would take less and less time as the total number of entries 
to be aggregated decreases. However, after each redistribution step the number 
of matrix entries per processor still participating in the computation step is 
increasing again. Thus the build time has a log{P) dependency. The time TR 
needed for the coarsening increases much more whenever an additional level of 
agglomeration is needed. 
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Table 3: Poisson Problem 3D on helics3a: Weak Scalability 



We perform a slightly modified test on helics3a, where the Poisson problem 
is solved, described by 

-Au = (6-4||a;||)ell^ll in f2 = (0, l)^ 

u = ell^ll on on 

As helics3a has more main memory per core it is possible to use a grid which 
has more than eight times as many grid cells per core than on JUGENE. 

The results of the weak scalability test can be found in Table 3. Please 
note that despite the larger problem per process the same number of levels in 
the matrix hierarchy as before is constructed. This is equivalent to an eight 
times larger problem on the coarsest level. The build time scales much better 
under these circumstances. This has two reasons. First, the size of the coarse 
grid problem after agglomeration is smaller compared to the large number of 
unknowns per processor. Secondly, the fraction of TR needed for the matrix 
decomposition is higher due to the larger matrix on the coarsest level, which 
reduces the influence of the graph partitioning and redistribution. 

On helics3a with its much faster processor cores compared to JUGENE, the 
memory bandwidth becomes the limiting factor for the solution phase. While 
for the case of eight processes it would in principle be possible to distribute the 
processes in a way that each process still has full memory bandwidth, we did 
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not exploit this possibility as it is very tedious to achieve such a distribution 
and as it is no longer possible for the case of 64 or more processes anyhow. With 
the automatic process placement of the operating system processes will share a 
memory controller already in the case of 8 processes, which is reflected in the 
notable increase of the time per iteration. As expected the time per iteration is 
only slightly increasing when using even more processes. The hierarchy building 
is much less affected by the memory bandwidth limitation. 

In addition we perform a strong scalability test on helicsSa where the total 
problem size stays constant while the number of cores used increases. In this 
test we use decoupled coarsening until we reach the coarsening target and then 
agglomerate all the data onto one process at once and solve the coarse level 
system there. The results can be seen in Tables 4 and 5. Note, that when using 
512 processes our method still has an efficiency of 27%. Again for the time 
needed for the solution phase (column TS) the biggest drop in efficiency occurs 
when using eight instead of one core due to the limited memory bandwidth. The 
setup phase (column TB) is not limited as much by it and scales much better 
than on JUGENE. 
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Table 4: Poisson Problem 3D on helics3a: Strong Scalability 
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64 


0.79 


0.51 


0.77 


0.59 


512 


0.27 


0.28 


0.37 


0.28 



Table 5: Poisson Problem 3D on helics3a: Strong Efficiency 

The last model problem we investigate is the diffusion problem 

V • (fc(x)Va;) = / 

on the unit cube [0, 1]'^ with Dirichlet boundary conditions and jumps in the 
diffusion coefficient as proposed in [17]. Into the unit cube a smaller cube with 
width .8 is centered such that all faces are parallel to the faces of the enclosing 
cube. The diffusion coefficient k in this smaller cube is 10"^. Outside of the 
small cube A; = 1 holds except for cubes with width 0.1 that are placed in all 
edges of the imit cube. There the diffusion coefficient is 10^^. Again we use a 
cell-centered discretization scheme and the same settings as before for the AMG. 
The results of a weak scalability test on 64 racks of JUGENE can be found in 
Table 6. 
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Table 6: Heterogeneous Diffusion Problem 3D on JUGENE: Weak Scalability 

Compared to the Poisson problem the number of iterations increases more 
steeply due to the jumps in the diffusion coefficient. Again this is not due to 
the parallelization hut due to the nature of the problem. The time needed for 
building the matrix hierarchy as well as the time needed for one iteration scale 
as for the previous problems. Compared to the AMG method used in [17] on an 
(now outdated) Blue Gene/L our method scales much better when used on Blue 
Gene/P. In small parts this might due to the new architecture and the bigger 
problem size per core used. But this cannot explain all the difference in the 
scaling behavior. Additionally, the large problem size is only possible because 
of the smaller memory foot-print of our method. 

6 Summary and Conclusion 

We have presented a parallel algebraic multigrid algorithm based on non-smoothed 
aggregation. During the setup phase it uses an elaborate heuristic aggregation 
algorithm to account for highly variable coefficients that appear in many ap- 
plication areas. Due to its simple piecewise constant interpolation between the 
levels, the memory consumption of the method is rather low and allows for 
solving problems with more than lO^'^ unknowns using 64 racks of an IBM Blue 
Gene/P. The parallelization of the solution phase scales well for up to nearly 
300,000 cores. Although there is a sequential bottleneck in the setup phase of 
the method d\ie to the lack of scalable parallel graph partitioning software, the 
method still scales very well in terms of total time to solution. For comparison 
see [2] where during a weak s(;alability test for the Laplace problem on an IBM 
Blue Gene/P the total solution time for interpolation AMG increases by more 
than a factor 2 when going from 128 to 128,000 processes. In contrast, for our 
method with the same increase in total solution time we can go from 64 to up 
to 262,144 processes during weak scaling. 

We also have shown that our solver scales reasonably well even for hard prob- 
lems that have highly variable coefficients. Even for modern clusters consisting 
out of multicore machines the method scales very well and is only limited by 
the available memory bandwidth per core. 

Once scalable parallel graph partition software is available, the bottleneck 
of the sequential graph partitioning will disappear rendering the method even 
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more scalable. 
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